202211.00392v1 


chinaXiv 


ChinaXivA (ERAT! 
RESEARCH PAPER 


Deep Learning with Heterogeneous Graph Embeddings 
for Mortality Prediction from Electronic Health Records 


Tingyi Wanyan'”, Hossein Honarvar', Ariful Azad’, Ying Ding** & Benjamin S. Glicksberg'** 


'Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, USA 
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47405-7000, USA 
Dell Medical School, University of Texas at Austin, Austin, Texas 78701-1996, USA 
School of Informatics, University of Texas at Austin, Austin, Texas 78712-1139, USA 


*Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10065, USA 


Keywords: Electronic health records (EHRs); Convolutional Neural Networks (CNNs); Heterogeneous Graph 
Model (HGM); Machine learning; Deep learning 


Citation: Wanyan T.Y., et al.: Deep learning with heterogeneous graph embeddings for mortality prediction from electronic health 
records. Data Intelligence 3(3), 329-339 (2021). doi: 10.1162/dint_a_00097 
Received: December 24, 2020; Revised: April 2, 2021; Accepted: April 30, 2021 


ABSTRACT 


Computational prediction of in-hospital mortality in the setting of an intensive care unit can help clinical 
practitioners to guide care and make early decisions for interventions. As clinical data are complex and 
varied in their structure and components, continued innovation of modelling strategies is required to identify 
architectures that can best model outcomes. In this work, we trained a Heterogeneous Graph Model (HGM) 
on electronic health record (EHR) data and used the resulting embedding vector as additional information 
added to a Convolutional Neural Network (CNN) model for predicting in-hospital mortality. We show that 
the additional information provided by including time as a vector in the embedding captured the relationships 
between medical concepts, lab tests, and diagnoses, which enhanced predictive performance. We found that 
adding HGM to a CNN model increased the mortality prediction accuracy up to 4%. This framework served 
as a foundation for future experiments involving different EHR data types on important healthcare prediction 
tasks. 


t Corresponding author: Benjamin S. Glicksberg (Email: benjamin.glicksberg@mssm.edu; ORCID: 0000-0003-45 15-8090). 
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1. INTRODUCTION 


Timely prediction of in-hospital mortality within intensive care units (ICU) is beneficial [1, 2] for 
practitioners to tailor care and allow for earlier interventions to prevent deterioration [3, 4]. Electronic health 
record (EHR) data consist of information relating to patient encounters with a health system, such as disease 
diagnoses, vital signs, and medications, among others [5, 6] which are often used for machine learning 
(ML) predictions for different tasks in the biomedical domain including mortality prediction [7, 8, 9]. The 
inherent complexity of EHR data often require advanced modeling frameworks to gain robust performance 
for these tasks. A common modeling approach for EHR research is to use a 2-dimensional convolutional 
neural networks (CNN) with one dimension as time and the other as clinical features [10, 11, 12]. In 
healthcare-related CNN models, various medical features are normally concatenated to be directly used as 
inputs and create embeddings [13, 14, 15]. This form of feature representation can be powerful, but 
disregards the graphical structure and interconnectivity between medical concepts [16, 17] which can affect 
the CNN performance especially since EHR data are often sparse due to missingness [10]. 


In this work, we proposed a Heterogeneous Graph Model (HGM) to create a patient embedding vector, 
which better accounts for missingness in data for training a CNN model. The HGM model captures the 
relationships between different medical concept types (e.g., diagnoses and lab tests) due to its graphical 
structure. This relational representation facilitates capturing more complex patient patterns and encoding 
similarities. 


2. METHODOLOGY 
2.1 Data Set 


We conducted our experiments on de-identified EHR data from MIMIC-III [18]. This data set contains 
various clinical data relating to patient admission to ICU, such as disease diagnoses in the form of 
International Classification of Diseases (ICD)-9 codes, and lab test results as detailed in Supplementary 
Materials. We collected data for 5,956 patients, extracting lab tests every hour from admission. There are 
a total of 409 unique lab tests and 3,387 unique disease diagnoses observed. The diagnoses were obtained 
as ICD-9 codes and they were represented using one-hot encoding where one represents patients with 
disease and zero indicates those without. We binned the lab test events into 6, 12, 24, and 48 hours prior 
to patient death or discharge from ICU. From these data, we performed mortality predictions that are 
10-fold, cross validated. 


2.2 Convolutional Neural Network Model 


Convolutional neural networks (CNNs) are often used, and perform well, on image processing tasks [19] 
due to their inherent feature extraction and abstraction ability, which increase the accuracy for classification 
tasks. There are also studies that have demonstrated encouraging successes in using CNN for EHR analyses. 
In this work, we used a standard CNN model as the baseline. 
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Since CNNs typically require two dimensional inputs, we treated time as the horizontal dimension and 
medical events as the vertical dimension. For the time dimension, we recorded every event with one-hour 
binned increments with respect to the patient death or discharge time. In this model, the vertical dimension 
was constructed by concatenating two medical event vectors: lab tests and diagnoses. Every entry of the 
lab test vector recorded the value of a specific lab test by hour, and we pre-processed the lab test by 
considering the values between 0.5 and 99.5 percentile to remove any inaccurate measurement. We then 
normalized the data by calculating the standard score (z-score). We imputed missing lab values with zeros. 
For the diagnosis vector, the i-th entry is 1 if the i-th diagnosis is observed; otherwise 0. We treated mortality 
prediction as a binary classification, for which we used a softmax layer with two dimensions and cross- 
entropy for loss. 


2.3 Heterogeneous Graph Model 


The features used in baseline CNN model are essentially raw data concatenated together, which do 
not consider the relationships between medical concepts. We used an HGM to capture these inherent 
relationships by creating three different types of nodes: patient, lab test, and diagnosis. These different types 
of nodes are connected by two relation types: tested and diagnosed. These could be represented with two 
triples: 


Patient lab: {patient, tested, lab} 
Patient diagnosis : {patient, diagnosed, diagnosis} 


The testing relationship shows whether a specific lab test was given to a patient at a specific time, and the 
diagnosed relationship shows whether a patient was diagnosed with a disease. 


To represent the lab test and diagnosis node types, we used multi-hot encoding vector: X, €e {0,1} and 
Xa € {0,1}%’, and the i-th entry with the value of 1 indicating whether a specific lab test was performed 
or a specific diagnosis was given. 


2.3.1 Node Embeddings 


For capturing the relations between different medical events related to a patient, we first utilized the 
TransE model to project different types of nodes into the same latent space, and then classified those nodes 
that were connected as a similar group and the disconnected nodes as a dissimilar group. 


The TransE model uses a set of 1) projection matrices and 2) relation vectors. After initialization, projections 
and translations are optimized end-to-end. Heterogeneous nodes X,, X, and X4 are projected into a 
shared latent space with trainable projection matrices W,, W, and W; using the nonlinear mappings with 
Equation (1): 


c, =(W,-X,) 
c; =a(W,-X,) (1) 
cy = o(W,-X,) 
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where ø is a non-linear activation function and c,,c;,and c} are the latent representations of each type of 
node. Despite the fact that the EHR uses different dimensions for different data types X,, X, and X,, all node 
types are projected into the same latent space. Then we applied translation operations to link these different 
types of nodes with Equation (2): 


(2) 


where rp and rj, are the relation vectors connecting patients to lab tests and diagnoses, respectively. 
c; and cy are the semantically translated projection representation into the same latent space of patient 
embedding c,. 


2.3.2 Optimization Model 


For training the HGM, we applied a skip-gram optimization model, which increases the proximity 
between embedding points whose corresponding graph nodes are often connected after the projection and 
translation operations (Equation (3)): 


max), X /ogPr(N,(u) |u) (3) 


ueVtel, 
where N,(u) are the neighborhood vertices of center node u, and t € Ty is the node type. Here, we learned 
the node embeddings by maximizing the probability of correctly predicting the patient node’s associated 
lab tests and diagnoses. The prediction probability is modeled as a softmax function with Equation (4): 


Z, 


u 


Pr(c, | f(u)) = (4) 


where Ūū is the latent representation of patient u, Cc, is the latent representation of lab and diagnosis 


t 
neighbors of node u, and č, -Ūū is the inner product of the two embedding vectors representing their 


similarity. Z, is the normalization term Z, = Ye" that is a sum over all vertices V, each of which is 
veV 
represented as V, including all node types. Therefore, Equation (3) is simplified to Equation (5): 
pa x | (5) 
Numerical computation of Z, is intractable for large-scale graphs. So we adopted a negative sampling 
strategy to approximate the normalization factor. We eventually used the following optimization function 
(Equation (6)): 


L= >>) x E] (6) 


teT ueV j 
1 


c, EN, (u) j=l 
where ø (x) is the sigmoid function, which operates on the dot product between (c;.,) and a(x) = ear 


and K is the number of negative samples. P,(c) is the negative sampling distribution. 


For training HGM, we performed heterogeneous neighborhood sampling by its one-hop connectivity, 
and picked Patient node as the center node, since it has one-hop connections to both Diagnoses and 
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Lab_test nodes. Specifically, for one training center Patient node, we uniformly sampled 10 Diagnoses one- 
hop direct connected nodes, and 10 Lab_test one-hop direct connected nodes. From these sampled 10 
Diagnoses nodes, we sampled another 10 Patient nodes, each having connections with each of the prior 
10 Diagnoses nodes. In this way, we connected the center patient node with similar other Patient nodes by 
their common diagnoses. We also sampled the patient node which belongs to the next hour corresponding 
to the center Patient node. For negative sampling, we performed uniform sampling through all Diagnoses 
nodes and Lab_test nodes that do not have one-hop connections with the center training patient node. We 
then projected these different nodes into the same latent space through TransE model. After unifying the 
embeddings for different node types, each concept is represented as a point in a Euclidean space. In this 
space, we can measure the similarity between any two vectors using dot product. 


2.3.3 HGM Embeddings with CNN Model 


The HGM embedding vector encodes not only a patient's information, but also their relation with 
diagnoses, lab tests, and subsequent lab test results in time. The patient node is represented as a 
vector X, € R47 containing the numerical values measured from lab tests averaged at that time step. We 
concatenated the resulting embedding vectors to feed into the baseline CNN vertical feature dimension to 
form a final feature vector within every hour, and used these new features as the CNN input to predict 
mortality. In addition, since we encoded time as a relation type, we can infer the embedding vector of 
time steps with missing data based on information from the previous hour. We visualized this procedure 
in Figure 1. 


A 


medical concepts 


patient embedding 


X ER latent embedd 
c= o(W,X,) 


mes z logPr(N.(u)|f(u)) 


weV 


het 
t skip gram leaming 


X, €{0,1}7 
c" = o(WX) 


X, € {0,1} 
C*4= O(W,X,) 


Figure 1. (A) A graphical representation of the HGM for p: patient, i: lab test, and d: diagnosis data. (B) All graph 
nodes in (A) have a corresponding vector like those shown in (B). The vector representations can be projected into 
a shared space with the TransE method, and this projection is optimized for retaining relations in the original data 
in the embedding via skip-gram optimization. Finally, these vectors are concatenated into the CNN model for 
mortality prediction. 
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3. EXPERIMENTS 


We aim to predict mortality 6, 12, 24, and 48 hours prior to death and/or discharge. The CNN model 
was used for prediction as introduced in Section 2.2. The CNN model architecture has two convolutional 
layers, where the first convolution filter is 5x2, the second layer filter is 3x2, and a maxpooling layer 
between these two layers. Following the convolution layer is a fully connected layer with latent dimension 
of 100 neurons. The final layer is a sigmoid layer for predicting the output probability. We compared three 
different scenarios to test the impact of adding HGM embedding vectors as additional features to the 
framework: 


e HGM: Embed patient labs and diagnosis raw data 
e CNN: Use raw lab test feature 
e HGM+CNN: Concatenate the HGM patient embedding vector, and the raw lab test feature vector 


For baselines, we also compared our developed models with three traditional machine learning models: 
Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF). In this work, we used 
AUROC and AUPRC scores as the primary performance metric. We tabulated the results in Tables 1 and 
2, and we show the evaluation AUROC and AUPRC curves for these tasks in Figure 2. 


Table 1. Mortality prediction AUROC evaluation. 


Hours prior to death 


Model 
6 12 24 48 
LG 0.689+0.01 0.691+0.01 0.672+0.02 0.675+0.02 
SVM 0.654+0.01 0.661+0.02 0.652+0.01 0.653+0.01 
RF 0.667+0.02 0.671+0.01 0.663+0.02 0.654+0.02 
HGM 0.714+0.02 0.715+0.03 0.653+0.03 0.641+0.03 
CNN 0.782+0.01 0.771+0.02 0.775+0.01 0.767+0.01 
HGM+CNN 0.800+0.01 0.791+0.02 0.796+0.01 0.771+0.01 


Note: Mean values from 10-fold cross validation with standard deviation for confidence intervals. 


Table 2. Mortality prediction AUPRC evaluation. 


Hours prior to death 


Model 
6 12 24 48 
LG 0.5450.01 0.556+0.02 0.542+0.01 0.539+0.01 
SVM 0.487+0.02 0.501+0.01 0.498+0.02 0.487+0.02 
RF 0.512+0.02 0.523+0.02 0.510+0.01 0.503+0.01 
HGM 0.557+0.02 0.559+0.02 0.578+0.02 0.567+0.03 
CNN 0.590+0.01 0.577+0.02 0.589+0.01 0.585+0.02 
HGM+CNN 0.601+0.01 0.600+0.01 0.604+0.01 0.617+0.02 


Note: Mean values from 10-fold cross validation with standard deviation for confidence intervals. 
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Figure 2. Evaluation of AUROC and AUPRC curves for HGM, CNN, and HGM+CNN models. 


The testing results show that the HGM+CNN outperforms all baseline models and both the basic HGM 
and CNN models, indicating the additional information added from the HGM patient embeddings increases 
the accuracy of predicting in-patient mortality. The prediction accuracy of using different hours prior to 
death and/or discharge does not vary by much, indicating that different time windows do not have a major 
impact on the result for this particular task and modeling strategy. The prediction accuracy in the CNN 
model drops by 1% in the case of six hours prior to death and/or discharge, but not in the other two models, 
indicating that using the embedding features from HGM model is slightly more robust than the raw data. 


4. DISCUSSION AND CONCLUSION 


In this work, we proposed a method to incorporate patient embedding vector from an HGM model into 
a CNN model in order to provide more information via interconnectivity between different clinical concepts. 
We assessed the value of this implementation on a task of predicting mortality in EHR data. The results of 
our experiment show the superior performance of adding the additional patient embedding vector, which 
is pretrained from the HGM model, compared to pure raw features as the input to CNN model and 
traditional ML models, too. In one aspect, this is due to the fact that the HGM embedding vector captures 
additional relational information between different medical concepts, thus providing additional information 
to the CNN model. 


Furthermore, we observed that concatenating the HGM embedding vector with diagnosis feature vectors 
did not increase the accuracy versus using the concatenation between raw lab test and diagnosis feature 
vectors. This finding indicates that the raw lab test feature vector can provide unique information for CNN 
to utilize. At the same time, this finding indicates that the embedded patient vector from HGM model could 
lose some information from the raw lab test feature along the process of projecting these data into a low 
dimensional latent space. By concatenating all feature vectors, we aim to preserve the information from 
different data points, which helps to achieve higher mortality prediction accuracy. There are a few limitations 
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to this study. First, these findings need to be replicated in another data set. Also, exploring more baselines 
other than the ones shown in this work is beneficial for evaluating the improvements overall. We hope the 
findings from this work can be expanded in future directions that may add more EHR node types and time 
components on a variety of other important health-related predictive tasks. 
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